Rescuing biologically relevant consensus regions across replicated samples

Background Protein-DNA binding sites of ChIP-seq experiments are identified where the binding affinity is significant based on a given threshold. The choice of the threshold is a trade-off between conservative region identification and discarding weak, but true binding sites. Results We rescue weak binding sites using MSPC, which efficiently exploits replicates to lower the threshold required to identify a site while keeping a low false-positive rate, and we compare it to IDR, a widely used post-processing method for identifying highly reproducible peaks across replicates. We observe several master transcription regulators (e.g., SP1 and GATA3) and HDAC2-GATA1 regulatory networks on rescued regions in K562 cell line. Conclusions We argue the biological relevance of weak binding sites and the information they add when rescued by MSPC. An implementation of the proposed extended MSPC methodology and the scripts to reproduce the performed analysis are freely available at https://genometric.github.io/MSPC/; MSPC is distributed as a command-line application and an R package available from Bioconductor (https://doi.org/doi:10.18129/B9.bioc.rmspc). Supplementary Information The online version contains supplementary material available at 10.1186/s12859-023-05340-x.


Supplementary Figures
We then consider the conditional probability of given , i.e. the probability of a nucleotide to contain an annotation, conditionally of overlapping or non overlapping a peak. In particular, we define = ( = 1 | = 1) the probability of a nucleotide overlapping a peak to contain an annotation, and = ( = 1 | = 0) the probability of a nucleotide non-overlapping a peak to contain an annotation. We are interested in comparing and . We estimate the former with ̂= 1 / , where 1 is the coverage of annotations within peaks (i.e. for which = 1) and is the coverage of the union of all peaks. Similarly, we estimate the latter with ̂= 0 /( − ), where 0 is the coverage of annotations outside peaks (i.e. for which = 0) and − is the amount of genome non-overlapping any peak.
Functional enrichment is expressed as the peak caller/rescuer capability of finding as many annotations as possible with the smallest possible number of peak calls. A large functional enrichment is achieved when = ( = 1 | = 1) is higher than = ( = 1 | = 0), therefore we model it as the proportion difference: = − .

Functional Enrichment
Functional enrichment ( ) is defined as the difference between the proportion of the genome converted by peaks that overlap annotations (p) and the proportion of the genome without peaks that overlap annotations (a): Estimates: Where G is the genome size (bp count), U is the coverage of peaks (in bp), 0 is the coverage of annotations outside peaks (in bp), and 1 is the coverage of annotations within peaks (in bp).

Functional enrichment test
We test functional enrichment under the null hypothesis 0 : = 0, versus the alternative 1 : ≠ 0. Note that, although we are generally interested in testing for positive enrichment (i.e. > 0) for a given genomic feature, there are no strong assumptions allowing us to employ a one-sided test. For instance, under specific experimental conditions that may cause genome-wide repression of certain genomic elements, the peak caller/rescuer may find only a small number of active genomic regions (i.e., few peaks) and show feature under-representation (i.e., < 0). Significant feature under-representation could also arise within difficult genomic regions, such as repetitive or low complexity DNA, where it is hard to obtain reliable estimates of the signal-to-noise ratio and hence of peak presence.
Enrichment significance depends on the estimated effect size, i.e. on the difference − 0 where 0 is the null enrichment being tested (in our case, 0 = 0 ), and on its standard error ( ). The standard error measures the reliability of the estimated effect size, depending on the proportions we are comparing and the sample size. Under the null hypothesis 0 : = 0 we have = , so we can compute the pooled sample proportion as ̂= (̂+̂( − ))/( + ( − )) = / and employ it to estimate as ̂= √̂(1 −̂)(1/ + 1/( − )). With constant , as ̂ approaches 0.5 the standard error is maximized. Conversely, if either ̂ or 1 −̂ is a very small value, ̂(1 −̂) is small and so is the standard error. In our case, when the number of annotations is large (i.e. is large) we obtain a small ̂. This happens because a peak corresponding to a regulatory (e.g., histone marker) or processing (e.g., RNA Polymerase II binding) signal, thus being neither a technical artifact nor random background fluctuation, is generally associated with a genomic region having a functional role (either active or repressed, depending on the signal type), that is more likely found within an annotation. Therefore, we expect a larger fraction of enriched peaks in a frequent genomic annotation (e.g., promoters), rather than a rare one (e.g., snoRNA), for which we expect higher values.
Under the null hypothesis, is approximately distributed as a standard Normal (0,1), hence we can compute the p-value as the probability of observing a value as extreme as or more extreme than | |in a standard Normal distribution, and we can compute the 95% confidence interval for as 95% = (̂− 1.96 ̂; ̂+ 1.96 ̂). The 95% containing = 0 (i.e., the null enrichment) is a synonym of non-significant enrichment at level 5%. replicates, MSPC scalability was benchmarked using additional artificial replicates (i.e., replicates 3 to 30) generated by randomly altering the two available replicates.

TFB motif enrichment analysis at enhancers
The benchmarking results are given in Supplementary Figures 1-2, and show up to about 800x improvement in the runtime, as well as the scalability of the current MSPC version (both in runtime and memory usage) with respect to the number of replicates.
The scripts we employed for running the benchmarking are publicly distributed along with MSPC and available from https://github.com/Genometric/MSPC. Additionally, we have developed a Jupyter Notebook that runs in a Colaboratory environment, and publicly distributed it on MSPC's GitHub repository.

MSPC enrichment-based assessment in MCF7 cell line
We considered the ChIP-seq experiments for the transcription factors HDAC2, NRF1, and DDX20 available from ENCODE on MCF7 cells (see Supplementary Table 3). These three transcription factors were selected as examples of highly, moderately, and poorly biologically-enriched MSPCspecific rescued peaks, respectively, in the main K562 analysis presented in the paper (their ranking can be verified in Figure 2). HDAC2 is a master epigenetic regulator, for which we already presented an extended example of a rescued network in K562 cell (see Results). Regarding DDX20, the DEAD-box proteins have RNA-helicase activities that are not fully understood.
Finally, NRF1 has a role as a respiratory metabolism regulator.
For each of these three TFs, we pre-processed the data following the same pipeline we employed for K562 experiments (see Material and Methods). Then, we run the extended version of MSPC